iiiiiiiiiiiiiiiiiniiini 

US006545981B1 

(12) United States Patent (lo) Patent No.: us 6,545,981 Bl 

Garcia et al. (45) Date of Patent: Apr. 8, 2003 



(54) SYSTEM AND METHOD FOR 

IMPLEMENTING ERROR DETECTION AND 
RECOVERY IN A SYSTEM AREA NETWORK 

(75) iDventors: David J. Garcia, hos Gatos; Richard 
O. Larson, Cupertino, both of CA * 
(US); Stephen G. Low; WUllam J. 
Watson, both of Austin, TX (US) 

(73) Assignee: Compaq Computer Corporation, 
Houston, TX (US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

(21) Appl. No.: 09/224,115 

(22) Filed: Dec. 30, 1998 

Related U^. Application Data 

(60) Provisional applicaUon No. 60/070,650, filed on Jan. 7, 
1998. 

(51) Int. CI.'' H04L 1^56 

(52) U.S. CI 370/242; 370/248; 370/392; 

714/48 

(58) Field of Search 370/216, 217, 

370/225, 252, 228-230, 242, 248, 257, 
389, 394, 400, 465, 392; 340/2.1, 2.7, 3.43; 
709/107, 239, 250, 238; 714/100, 48, 21; 

379/221.01, 221.04 

(56) References Cited 

U.S. PATENT DOCUMENTS 

4,733350 A 3/1988 Tone et al 364/200 

5,193,149 A * 3/1993 Awiszio ct aL 395/200 

5,386,524 A 1/1995 Lary et al 395/400 

5,555,405 A 9/1996 Griesmer et aL 395/600 

5,563,879 A 10/1996 Sanders ct al 370/60 

5,574,849 A 11/1996 Sonnier et al 395/1821 

5,619,274 A 4/1997 Roop et al 348/461 

5,675,579 A • 10/1997 Watson et al 370/248 

5,678,007 A 10/1997 Hurvig 395/200.12 

5,706,281 A ♦ 1/1998 Hashimoto et al 370/352 



5,737,595 A 4/1998 Cohen et al 395/61 

5,802,050 A 9/1998 Petersen et al 370/394 

5,920,886 A 7/1999 Feldmcicr 711/108 

5,991,797 A 11/1999 Furtral ct al 709/216 

6,119,244 A 9/2000 Schoenthal et al 714/4 

6,272,591 B2 8/2001 Grun 711/114 

6,347337 Bl • 2/2002 Shah ct al 709/224 

2002/0029305 Al 3/2002 Satran et al 710/33 

FOREIGN PATENT DOCUMENTS 

EP 0 757 318 A2 2/1997 G06F/13/12 

OTHER PUBUCAnONS 

Garcia et aL, "ServerNet U" Parallel Computer Routing and 
Communication, (2"^ Int. WKSP), Jun. 26, 1997, pp. 
119-135, XP0021031 64, Atlanta, GA 
Eicken Von T ct al., "U-Net: A User-Level Network 
Interrface for Parallel and Distributed Computing," Oper- 
ating Systems Review (SIGOPS), vol. 29, No. 5, Dec. 1, 
1995, pp. 40-53. 

Dunning D. ct al., "The \lrtual Interface Architecture," 
IEEE Micro, vol 18, No. 2, 3/98, pp. 66-76. 

* cited by examiner 

Primary Examiner—4luy D. Vu 
Assistant Examiner— Due Ho 

(74) Attorney, Agent, or Firm — Oppenheimer, Wolff & 
Donnelly, LLP; Leah Sherry 

(57) ABSTRACT 

A system and method for facilitating both in-order and 
out-of-order packet reception in a SAN includes requestor 
and responder nodes, coupled by a plurality of paths, that 
maintain the good and bad status of each path and also 
maintain local copies of a message sequence number. If an 
error occurs for a transaction over a given path, the reqixestor 
informs the responder, over a good path, that the given path 
has failed and both nodes update their path status to indicate 
that the given path is bad. A barrier transactioo is used by the 
requestor to determine whether the error is transient or 
permanent, and, if the error is transient, the requestor retries 
the transaction. 

22 Clahns, 6 Drawing Sheets 



Fault Tolerant ServerNet II System Area Network 

Single Ported NICs 



I 



SNet 
NIC 











Typical End Nodes: 
•CPU 

• I/O Controller 

• PC.WOTlcstation 



SNet 




SNet 


• • • 


SNet 


NIG 




NIC 




NIC 













(FC-AL) 

Peurof contratlers, 
^ ■ each with two links 
one fabric 



n 



Router Node(s) 







SNet 
NIC 





SNet 




SNet 




SNet 


NIC 




NIC 


• • • 


NiC 


PCI 


Gigabit 




i 

ATM 



Dual Ported NICs 



05/10/2004, EAST Version: 1.4.1 



U.S. Patent Apr. 8, 2003 Sheet 1 of 6 



US 6,545,981 Bl 



ServerNet II protocol layers 
End Node VIA NIC Routing Node 



VI Session Layer 



Transaction 




Transaction 


Packet 




Packet 


Link 




Link 


MAC 




MAC 


Physical 




Physical 



FIG. 1 



Routing Layer 


Packet 


• • • 


Packet 


Link 


Link 


MAC 


MAC 


Physical 


Physical 



ServerNet II System Area Network 



Server Net 



SNet 




SNet 




SNet 


NIC 




NIC 




NIC 













Fiber Channel 
• Arbitrated Loop 
(FC-AL) 

i 

I 



SNet 
NIC 









Router Node(s) 





SNet 
NIC 







Typical End Nodes: 
•CPU 

• I/O Controller 

• PC. Workstation 



SNet 




SNet 




SNet 


NIC 




NIC 


• • • 


NIC 


PCI 


Gigabit 




ATM 



Ethernet 

FIG. 2 



05/10/2004, EAST Version: 1.4.1 



U.S. Patent Apr. 8, 2003 sheet 2 of 6 US 6,545,981 Bl 



Fault Tolerant ServerNet II System Area Network 

Single Ported NICs 



(FC-AL) 

Pair of controllers, 
each with two links 
xV^''^*o fabric 

i 
I 




L 


SNet 
NIC 




SNet 
NIC 


• • • 


SNet 
NIC 


Typical End Nodes: 
•CPU 

• I/O Controller 


PCI 


Gigabit 




ATM 



Dual Ported NICs 



' PC, Workstation 



Ethernet 



FIG. 3 





SNIOq 


SNet II 




Endnode 




A 


SNID^ 




SNIDo 






SNet II 




Endnode 


SNID1 


B 



Four Logical Paths between SNet II Endnodes 



FIG. 4 



05/10/2004, EAST Version: 1.4.1 



U.S. Patent Apr. 8, 2003 sheet 3 of 6 US 6,545,981 



FIG. 5 



Descriptors in the 
Work Queue 

Send (one packet) 
Send (two packets) j 

Send (three packets) . 



RDMAOp 
(2500 bytes, 
sent unordered) 



Send (one packet) 
Send (one packet) 




Requestor 

Rqst. SN = 0 

Rqst. SN = 1 

Rqst. SN = 2 

Rqst. SN = 3 

Rqst. SN = 4 

Rqst. SN = 5 

Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 



Rqst. SN = 6 
Rqst. SN = 7 
Rqst. SN = 8 



Responder 




Rsp. SN = 0 

Rsp. SN = 1 

Rsp. SN = 2 

Rsp. SN = 3 
Rsp. SN = 4 

Rsp. SN = 5 

Rsp. SN = 6 
Rsp, SN = 6 
Rsp. SN = 6 
Rsp. SN = 6 
Rsp. SN = 6 



Rsp. SN = 6 

Rsp. SN = 7 
Rsp. SN = 8 



FIG. 6 



Sequence Number Example 



05/10/2004, EAST Version: 1.4.1 



U.S. Patent Apr. 8, 2003 



Sheet 4 of 6 



US 6,545,981 



Requester's Path 
States 



OK To Use Path 

This path is functional 
and requests may 
be sent on this 



Banner 
/Succeeds 



Request 
times out 



Barrier 

Attempt a Barrier to 
ensure the path is 
free of stale requests 
or responses 



Retry/Barrier 
Fails 



Disable Remote Node's 
Path Enable Bit 

Use supervisory protocol 
to disable the path. Cannot 
use this Vl (or BTE) until 
the path is disabled. 



Responder's Path 
States 



Requests Allowed 
on this Path 

i.e. set the bit in 
ReqinPathVector 
correspond 



Initial state 
► 



Requests Not Allowed 
on this Path 

i.e. set the bit in 
ReqinPathVector 
conrespond to this 
path to a zero 



Path Fenced 

OK to restart VI (or BTE) 
using a different, valid 
path. Periodically do 

bamers to checl< if path 
can be reinstated. 



Initial state 

M 



/ ' 
/ / 
/ / 
/ / 
/ / 
^ / 



Barrier Succeeds 



Wait for Remote Node 
to Enable Path 

Wait for responder to 



FIG. 7 

Path State Diagram 



05/10/2004, EAST Version: 1.4.1 



U.S. Patent Apr. 8, 2003 Sheet 5 of 6 US 6,545,981 Bl 



Descriptors in the 
Work Queue 

Send A 
(one packet) 

Send B 
(two packets) 

Send C 
(three packets) 



RDMA Op D 
(2500 bytes, 
sent unordered) 



Retry 2nd half of B 
Retry C 

Retry D 



{ 



Requestor 

Rqst. SN = 0 

Rqst. SN = 1 

Rqst. SN = 2 

Rqst. SN = 3 

Rqst. SN = 4 

Rqst. SN = 5 

Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 



Rqst. SN = 2 

Rqst. SN = 3 
Rqst. SN = 4 
Rqst. SN = 5 



Res ponder 




Rsp. SN 
Rsp. SN 
Rsp. SN 

Rsp. SN 
Rsp. SN 

Rsp. SN 



Rsp. SN 
Rsp. SN 
Rsp. SN 



0 

■ 1 
2 

2 
2 

2 



Rsp. SN = 2 
Rsp. SN = 2 
Rsp. SN = 2 



Rsp. SN = 2 



3 
4 
5 



Error Recovery - Lost Request 



FIG. 8 



05/10/2004, EAST Version: 1.4.1 



U.S. Patent Apr. 8, 2003 Sheet 6 of 6 US 6,545,981 Bl 



Descriptors in the 
Work Queue 

Send A 
(one packet) 

Send B 
(two packets) 

Send C 
(three packets) 



RDMA Op D 
(2500 bytes, 
sent unordered) 



Retry B 
Retry C 

Retry D 



Requestor 

Rqst. SN = 0 

Rqst. SN = 1 

Rqst. SN = 2 

Rqst. SN = 3 

Rqst. SN = 4 

Rqst. SN = 5 

Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 
Rqst. SN = 6 



Rqst. SN = 1 
Rqst. SN = 2 
Rqst. SN = 3 
Rqst. SN = 4 



Responder 




Rsp. SN 

Rsp. SN 

Rsp. SN 

Rsp. SN 
Rsp. SN 

Rsp. SN 



0 

1 

2 

3 
= 4 

= 5 



Rsp. SN = 6 
Rsp. SN = 6 
Rsp. SN = 6 



Rsp. SN 


= 6 ^ 




Rsp. SN 


= 6 




Rsp. SN 


= 6 




Rsp. SN 


= 6 J 





•—The responder detects 
the incoming packets as 
being from tlie past 
and ignores them. 

Error Recovery - Lost Acknowledgment 



FIG. 9 



05/10/2004, EAST Version: 1.4.1 



us 6,545^ 

1 

SYSTEM AND METHOD FOR 
IMPLEMENTING ERROR DETECTION AND 
RECOVERY IN A SYSTEM AREA NETWORK 

CROSS-REFERENCES TO RELATED 5 
APPLICAnONS 

This application claims priority from Provisional App. 
No. 60/070,650, filed Jan. 7, 1998, which is incorporated 
herein by reference. 

BACKGROUND OF THE INVENTION 

Traditional network systems utilize either channel seman- 
tics (send/receive) or memory semantics (DMA) model. 
Channel semantics tend to be used in I/O environments and 15 
memory semantics tend to be used in processor environ- 
ments. 

In a channel semantics model, the sender does not know 
where data is to be stored, it just puts the data on the channel. 
On the sending side, the sending process specifics the 20 
memory regions that contain the data to be sent. On the 
receiving side, the receiving process specifies the memory 
regions where the data wiQ be stored. 

In the memory semantics model, the sender directs the 
data to a particular location in the memory, utilizing remote ^5 
direct memory access (RDMA) transactions. The initiator of 
the data transfer specifies both the source buffer and desti- 
nation buffer of the data transfer. There are two types of 
RDMA operations, read and write. 

The virtual interface architecture (VIA) has been jointly 
developed by a number of computer and software compa- 
nies. VIA provides consumer processes with a protected, 
directly accessible interface to network hardware, termed a 
virtual interface. VIA is especially designed to provide low 
latency message commimication over a system area network 
(SAN) to facilitate multi-processing utilizing clusters of 
processors. 

A SAN is used to interconnect nodes within a distributed 
computer system, such as a cluster. The SAN is a type of ^ 
network that provides high bandwidth, low latency commu- 
nication with a very low eaor rate. SANs often utilize 
fault-tolerant capabiUty. 

It is important for the SAN to provide high reliability and 
high-bandwidth, low latency communication to fulfill the 45 
goals of the VIA Further, it is important for the SAN to be 
able to recover firom errors and continue to operate in the 
event of equipment failures. Enor recovery must be accom- 
plished without high CPU overhead associated with all 
transactions. Furthermore, error recovery should not 50 
increase the complexity for the consumer of VIA services. 

SUMMARY OF THE INVENTION 

According to one aspect of the present invention, a SAN 
maintains local copies of a sequence number for each data ss 
transfer transaction at the requester and responder nodes. 
Each data transfer is implemented by the SAN as a sequence 
of request/response packet pairs. An error condition arises if 
a response to any request packet is not received at the 
requesting node. The responder and requestor nodes are 60 
coupled by a plurality of paths and each node maintains a 
record of the good or bad status of each path. If a transaction 
fails and the path is permanently bad, both nodes update 
their status to indicate that the path is bad. This is to prevent 
further transactions firom including any stale requests that 65 
are potentially still in the network from arriving at the 
destination and potentially corrupting data. 



,981 Bl 

2 

According to another aspect of the invention, if an error 
occurs on a path the requestor node implements a barrier 
transaction on the path to determine if the failure is perma- 
nent or transient. 

According to another aspect of the invention, the barrier 
transaction is performed by writing a number chosen from a 
large number space in a way that minimizes the probability 
of reusing the number in a short period of time. 

According to one aspect of the invention, the number is 
randomly diosen from a large number space. 

According to another aspect of the invention, the large 
number is based on the requestor ID and an incrementing 
component managed by the requestor. 

According to another aspect of the invention, if the failure 
is transient the requestor retransmits packets starting with 
the packet that first caused an error condition to be detected. 

According to another aspect of the invention, a sequence 
number is included in each request packet and copied into 
each response packet. A local copy of the sequence number 
is maintained at the requestor and responder nodes. If the 
sequence number in the request packet does not match the 
sequence number at the responder, a negative acknowledge 
response packet is generated. 

Other features and advantages of the invention will be 
apparent in view of the following detailed description and 
appended drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram depicting ServerNet protocol 
layers implemented by hardware, where ServerNet is a SAN 
manufactured by the assignee of the present invention; 

FIGS. 2 and 3 are block diagrams depicting SAN topolo- 
gies; 

FIG. 4 is a schematic diagram depicting logical paths 
between end nodes of a SAN; 

FIG. 5 is a schematic diagram depicting routers and links 
connecting SAN end nodes; 

FIG. 6 is a graph depicting the transmission of request and 
response packets between a requestor and a responder end 
node. FIG. 6 shows the sequence numbers used in packets 
for three Send operations, an RDMA operation, and two 
additional Send operations. The diagram shows the 
sequence numbers maintained in the requestor logic, the 
sequence number contained in each packet, and the sequence 
numbers maintained at the responder logic; 

FIG. 7 is two interlocked state diagrams showing the state 
that software on the requestor and responder moves through 
for each path; 

FIG. 8 is a graph depicting retransmission during error 
recovery due to a lost request packet; and 

FIG. 9 is a graph depicting retransmission during error 
recovery due to a lost acknowledgment packet. 

DESCRIFHON OF THE EMBODIMENTS 

The preferred embodiments will be described imple- 
mented in the ServerNet n (ServerNet) architecture, manu- 
factured by the assignee of the present invention, which is a 
layered transport protocol for a System Area Network 
(SAN) optimized to support the Virtual Interface (VI) archi- 
tecture session layer which has stringent user-space to 
user-space latency and bandwidth requirements. These 
requirements mandate a reliable hardware (HW) message; 
transport solution with minimal software (SW) protocol 
stack overhead. The ServerNet II protocol layers for an end 
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node VI Network Interface controller/Card (NIC) and for a 
routing node arc illustrated in FIG. 1. A single NIC and VI 
session layer may support one or two ports, each with its 
associated transaction, packet, link-level, MAC (media 
access) and physical layer. Similarly, routing nodes with a 5 
common routing layer may support multiple ports, eadi with 
its associated link-level, MAC and physical layer. 

Support for two ports enables the ServerNet II SAN to be 
configured in both non-redundant and redundant (fault 
tolerant, or FT) SAN configurations as illustrated in FIG. 2 10 
and FIG. 3. On a fault tolerant network, a port of each end 
node may be connected to each network to provide contin- 
ued VI message communication in the event of failure of one 
of the SANs. In the fault tolerant SAN, nodes may be ported 
into a single fabric or single ported end nodes may be 15 
grouped into pairs to provide duplex FT controllers. The 
fabric is the collection of routers, switches, connectors, and 
cables that connect the nodes in a network. 

The following describes general ServerNet II terminology 
and concepts. The tise of the term "layer'' in the following ^ 
description is intended to describe functionality and does not 
imply gate level partitioning. 

Two ports are supported on a NIC for both performance 
and fault tolerance reasons. Both of these ports operate 
under the same session layer VI engine. That is, data may ^ 
arrive on any port and be destined for any VI. Similarly, the 
Vis on the end node can generate data for any of these ports. 

ServerNet II packets are comprised of a series of data 
symbols followed by a packet framing command. Other 
commands, used for flow control, virtual channel support, 
and other link management functions, may be embedded 
within a packet. Each request or response packet defines a 
variety of information for routing, transaction type, 
verification, length and VI specific information. '^^ 

i. Routing in the ServerNet II SAN is destination-based 
using the first 3 bytes of the packet. Each NIC end node 
port in the network is uniquely defined by a 20 bit Port 
SNID (ServerNet Node ID). The first 3 bytes of a 
packet contain the Destination port's SNID or DID ^ 
(destination port ID) field, a three bit Adaptive Control 
Bits (ACB) field and the fabric ID bit. The ACB is used 

to specify the path (deterministic or link-set adaptive) 
tised to route the packet to its destination port as 
described in the following section. 

ii. The transaction type fields define the type of session 
layer operation that this ServerNet II packet is carrying 
and other information such as whether it is a request or 
a response and, if a response, whether it is an ACK 
(acknowledgment) or a NACK (negative 
acknowledgment). The ServerNet II SAN also supports 
other transaction types. 

iii. Transaction verification fields include the source port 
ID (SID) and a Transaction Serial Number. The trans- 
action serial number enables a port with multiple 55 
requests outstanding to uniquely match responses to 
requests. 

iv. The Lxngtb field consists of an encoding of the number 
of bytes of payload data in the packet. Payloads up to 
512 bytes are supported and code space is reserved for eo 
future increases in payload size, 

v. The VI Session Layer ^ecific fields describe VI 
information such as the VI Operation, the VIA 
Sequence ntmiber and the Virtual Interface ID number. 
The VI Operation field defines the type of VI transac- 65 
tion being sent (Send, RDMARcad, RDMA Write) and 
other control information such as whether the packet is 
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ordered or unordered, whether there is immediate data 
and/or whether this is the first or last packet in a session 
layer multi-packet transfer. Based on the VI transaction 
type and control information, a 32 bit Immediate data 
field or a 64 bit Virtual address may follow the VI ID 
number. 

vi. The payload data field carries up to 512 bytes of data 
between requesters and responders and may contain a 
pad byte. 

vii. The CRC field contains a checksum computed over 
the entire packet. 

Transaction Overview 

The basic flow of transactions through the ServerNet 11 
SAN will now be described. VI requires the support of Send, 
RDMA read and RDMA write transactions. These arc trans- 
lated by the VI session layer into a set of ServerNet II 
transactions (request/response packet pairs). All data trans- 
fers (e.g., reading a disk file to CPU memory, dumping large 
volumes of data from a disk farm directly over a high-speed 
communications link, one end node simply interrupting 
another) consist of one or more such transactions. 

Creating a Request Packet 

The VI User Agent provides the low level routines for 
VIA Send, RDMA Write, and RDMA Read operations. 
These routines place a descriptor for the desired transfer in 
the appropriate VT queue and notify the VIA hardware that 
the descriptor is ready for processing. The VIA hardware 
reads the descriptor, and based on the descriptor contents, 
builds the ServerNet request packet header and assembles 
the data payload (if appropriate). 

Dual Ports and Ordering 

In a NIC with two ports, it is possible for a single VIA 
interface to process Sends and RDMA operations from 
several different Vis in parallel It is also possible for a large 
RDMA transfer from a single VI to be transferred on both of 
the ports simultaneously. Tliis latter feature is called Multi- 
pathing. 

ServerNet n end nodes can connect both their ports to a 
single network fabric so that there are up to four possible 
paths between ServerNet II end nodes. Each port of a single 
end node may have a unique ServerNet ID (SNID). FIG. 4 
depicts the four possible paths that End node A can use when 
sending request to End node B: 

1) End node A SNID[0] to End node B SNID[0] 

2) End node ASNID[0] to End node B SNID[1] 

3) End node A SNID[1] to End node B SNID[0] 

4) End node ASNID[1] to End node B SNID[1] 

FIG. 5 depicts a network topology utilizing routers and 
links. In FIG. 5, end nodes A-F, each having first and second 
send receive ports 0 and 1, are coupled by a ServerNet 
topology including routers R1-R5. Links are represented by 
lines coupling ports to routers or routers to routers. A first 
adapdve set (fat pipe) 2 couples routers RI and R3 and a 
second adaptive set (fat pipe) 4 couples routers R2 and R4, 

Routing may be deterministic or link set adaptive. An 
adaptive link-set is a set of Unks (also called lanes) between 
two routers that have been grouped to provide higher 
bandwidth. The Adaptive Control Bits (ACB) specify which 
type of routing is in effect for a particular packet. 

Deterministic routing preserves strict ordering for packets 
sent from a particular source port to a destination port. In 
deterministic routing, the ACB field selects a single path or 



05/10/2004, EAST Version: 1.4.1 



us 6,545,981 Bl 

5 6 

lane through an adaptive link-set. Send transactions for a packet status is carried by a link symbol TPG (this packet 

particular VI require strict ordering and therefore use deter- good) or TPB (this packet bad) appended at the end of the 

nunistic routing. packet. Since packet status is diedced on each link, a packet 

RDM A transactions, on the other hand, may make use of status transition (good to bad) can be attributed to a specific 

all possible paths in the network without regard for the 5 link. Tbe packet routing process described above is repeated 

ordering of packets within the transaction. These tr ansae- for each router node in the selected path to the destination 

tions may use link-set adaptive routing as described below. node. 
The ACB field specifies which specific link (or lane) in this 

link-set is to be used for deterministic routing. Receiving a Request Packet 

Alternatively, the ACB field can specify link-set adaptiv- lo *u * t . • . .l j j 

... ill 4U 1 * * J • 11 u When the request packet arnves at the destmation node, 

ily which enables the packets to dynamicaUy choose from serverNet II interface receiver checks its validity (e.g. 

any of the Imks m the link-set. , . . j ^ tt-* .i, i .u • 

\ 1.1 1 j-tt * 1 r must contain correct destmation node ID, the length is 

A sample topology witti several different examples of ^ p^^^^ p^^^^ ^j, 

multipathmg using bnk and path adapUvity is shown m HO. ^^^^^ ^^^^^^ ^^j^ ^^^^ 

■ Multipathing allows large block transfers done with ? valid request, and CRC must be good,). If the packet is 

nri\*A D A •* ? * • I* 1 u *u mvalid for any reason, the packet IS discarded. The Server- 

RDMARead orWnte operaUons to simuUaneously use both „ ^^^^^^ may save error sutus for evaluation by 

ports as weU ^ adaptive Imks between the two commum- ^^^^^ „ ^^^^^ ^^^^^^ / 

catmg NICs. Since the data transfer diaracteristics of any . , , „ •« « -r *u * -a 

* J * i_ L * I.' .L- 11 checks are made. Specifically, if the request speafies an 

one VI are expected to be bursty, multipathmg allows the 20 „ ^ *u - l ^ I. 

J J . Lin-* r • 1 * f RDMA Read or Write, the address is checked to ensure 

end node to marshal all its resources for a smgle transfer. , . ui^r *u* -*-i ttiai *i- 

M . 4 1** *u- -1 * • *u 4U u 4 f access has been enabled for that particular VI, Also, the 

Note that multipathmg does not mcrease the throughput of * j o ii^ r*i- i . ^ i 

multiple Send operations from one VI. Sends from one VI ""P^' P°" f"'* 'Pi"* *f P^'^K*' ««1 <=hecked to ensure 

*u * * • *i -J J c- *u A ' access to the particular VI is allowed on that input port from 

must be sent stnctly ordered. Smce there are no ordermg ^- , o rc.u i . • i-j 

. ^ 1 ^ • • t„ j-ar * the particular Source. If the packet is vahd, the request can 

guarantees between packets originating from different ports 25 

on a NIC, only one port may be used per Send. Furthermore, ^ 

only a single ordered path through the Network may be used, Response Packet 
as described in the following. 

A response is created based on the success (ACK 

Transaction and Packet Layers 30 response) or faihire (NACK response) of the request packet. 

Hie transaction layer builds the ServerNet II request A successful read request, for example, would include the 

packet by filling in the appropriate SID, Transaction Serial ^^^^ ^^^^ response. TTie source node ID from the 

Number (TSN), and CRC. The SID assigned to a packet ^^^^^^^ P«^^^^ is used as the destination node ID for the 

always corresponds to the SNID of the port the packet response packet. The response packet must be returned to 

originates from. The T^N can be used to help the port ^^^^ P^*"*- ^onsc is 

manage multiple outstanding requests and match the result- necessanly the reverse of the path taken by the request, 

ing responses uniquely to the appropriate request. The CRC ^h^ ^^^work may be configured so that responses take very 

enables the data integrity of the packet to be checked at the different paths than requests. If strict ordering is not 

end node and by routers enroute. required, the response, like the request, may use Unk-set 

„ „ , ^ VT . XT 1- 1 . 1 1 -40 adaptivity. The response packet is routed back to the SNID 

FoIlDwing the ServerNet II hnk protocol, tiie packet is ^ g,,^ 

encoded ma senes of data symbols foUowed by a status ^quest packet is also duplicated for the response packet. 

command. The ServerNet n link layer uses other commands t r i- r r 

for flow control and link management. These commands ^h^ response can be matched with the request using the 

may be inserted anywhere in the link data stream, including '^^ the packet vaUdity checks. If an ACK response 

between consecutive daU symbols of a packet. Finally, the P^^ ^^^^ ^^^^ts, the transaction layer passes the response 

symbols are passed through the MAC layer for transmission ^° ^^^"^^ resources associated with the 

on the physical media to an intenncdiate routing node. request, and reports the transaction as complete. If a NACK 

response passes these tests, the end node reports the failure 

Routing of the transaction to the session layer. If a valid ACK/NACK 

. , - . . . , , . response is not received within the allotted time limit, a 

Hie routing control funcUon is programmable so that the time-out error is reported, 

packet routing can be changed as needed when the network ^ ^ . ^ . 

configuration changes (e.g., route to new end nodes). Router „ ^^^^^^^ stream many stncdy ordered ServerNet 

nodes serve as crossbar switches; a packet on any incoming " messages onto the N^re before receiving an acknowledg- 

(receive) side of a link can be switched to the outgoing ss T"'* f ^^^""^ T ^lowsj^c requestor to 

(transmit) side of any hnk. As the incoming request pack^ ^P P^'^^^^ outslandmg per VI. 

arrives at a router node, the first three bytes, containing the hardware can operate m one of two modes with 

DID and ACB fields, are decoded and used to select a link respect to generating multiple outsUnding request packets: 

leading to the destination node. If the transmit side of tbe 1. The hardware can stream packets firom the same VI 

selected link is not busy, the head of the padcet is sent to the eo send queue onto the wire, and start the next descriptor before 

destination node whether or not the tail of the packet has receiving all the acknowledgments from the current descrip- 

arrived at the routing node. If the selected hnk is busy with tor. This is referred to as "Next Descriptor After Launch" or 

another packet, the newly arrived packet must wait for the NDAL. 

target port to become free before it can pass through the 2. The hardware can stream packets to a single descriptor 

crossbar. 6S onto the wire but wait for all the outstanding acknowledg- 

As the tail of the packet arrives, the router node checks the ments to complete before starting the next descriptor. This is 

packet CRC and updates the packet status (good or bad). The referred to as "Next Descriptor after ACK" or NDAA. 
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The choice of NDAL or NDAA modes of operation is RDMA request packets niay be sent ordered or unordered, 

determined by how strongly ordered the packets are gener- A bit in the packet header is set to 1 for ordered packets and 

ated. is set to 0 for unordered packets. As will be explained later, 

Ordered and unordered messages may be mixed on a bit is used by the responder logic to determine if it 

single VI. When generating an unordered message, the 5 should ino-cment its copy of the expected sequence number, 

requester must wait for completion of all acknowledgments Sequence numbers do not increment for unordered packets, 

to unordered packets before starting the next descriptor Th^ ^ different source ports, destination 

ports and adaptive paths for the packets. This freedom can 

Ordering of Send Packets Presented to Transaction be exploited for a performance gain through multipathing; 

Layer 10 simultaneously sending the RDMA packets of a single 

^ . , , message across multiple paths. 

The VI architecture has no explicit ordermg rules as to j^^^m a n ^ i * * -i. 

how the packets that make up a single descriptor are ordered ^ ! ? • ^ sen' over a paOi 

among themselves. TTiat is. VIA only guarantees the mes- ^''^^i' ^^"^ ordenng with the Send packets 

sage ordering the client wiU see. For example. VIA requires ^^^^ launching 

that Send descriptors for a particular VI be completed in " packets for the foUowing message. TTienert message cam^^^ 

order,buttheVIAspecificationdoesnotsayhowthepackets be starteduntd the last acknowledgment of the RDMARead 

WiU proceed on the wire. or Wnte operaUon successfully completes. 

The ServerNet II SAN requires that all Send packets „l°jf" 7"^:^'"'° ™l'^P='*i°g is used to generate 

destined for a particular VI be delivered by the SAN in strict 20 °' ^ntc requests, the hardware must operate 

order. As long as deterministic routing is used, the network 1^ NDAAmode. This ensures the RDMARead or Wnte 

assures strict ordering along a path from a particular source completed before moving on to subsequent descnptots. 

node to a particular destination node. This is necessary An end node may choose to send RDMA packets strictty 

because the receiving node places the incoming packets into ordered. This can be advantageous for smaller RDMA 

a scatter list. Each incoming packet goes to a destination 25 f »nsfers as the hardware can operate in NDAL mode. The 

determined by the sum total of bytes of the previous packets. VI can proceed to the next descriptor immediately after 

The strict ordering of packets is necessary to preserve launching the last packet of a message that is sent strictty 

integrity of the entire block of data being transferred because ordered (and hence used incrementing sequence numbers), 

incoming packets are placed in consecutive locations within /->j. .jn ni»..u 

the block of data. Each packet has a sequence number to 30 J-X"'* 

allow the receiver to detect an out of order, missing, or espon er 

repeated packet. The ServerNet II end node must respond to incoming . 

There are two ways for an end node to meet these ordering Send requests and RDMA Write requests from a particular 

requirements: ' VI in strict order, and must write these packets to memory 

a. The end node can wait for the acknowledgment from ^5 m strict order, 

each Send packet to complete before starting another Send The ServerNet II end node must also respond to incoming 

packet for that VI. By waiting for each acknowledgment, the RDMA Read requests from a particular VI in strict order, 

end node does not have to worry about the network provid- Because response packets are transported by the network 

ing strict ordering and can choose an arbitrary source port, in strict order, the requestor will receive all incoming 

adaptive link set and destination port for each message. ^ response packets for a particular VI in the same order as that 

b. The end node can restrict all the Send operations for a in which the corresponding requests were generated, 
given VI to use the same source port, the same destination ^rr 

port, and a single adaptive path. By choosing only one path ^^^S^ Sequence Numbers 

through the network, the end node is guaranteed that each The ServerNet SAN uses acknowledgment packets to 

Send packet it launches into the network will arrive at the infonn the requestor that a packet completed successfully, 

destination in order. Sequence numbers in the packets (and acknowledgments) 

The second approach requires the VIA end node to are used to, allow the sender to support multiple outstanding 

maintain state per VI that indicates which source port requests to ensure adequate performance and to be able to 

destination port and adaptive path is currently in use for that recover from errors occurring in the network, 

particular VI. Furthermore, the second approach allows the FIG. 6 is a graph depicting the generation, checking, and 

hardware to process descriptors in the higher performance updating of VIA sequence numbers at requestor and 

NDAL mode. responder nodes. In FIG. 6, time iuCTeases in the downward 

With the second approach, Send packets from a single VI direction. Requests are indicated by solid arrows directed to 

can stream onto the network without waiting for their 55 the right and responses by dotted arrows directed to the left, 

accompanying acknowledgments. An incrementing _ kt i_ t ■ • i- • 

sequence number is used so the destination node can detect Sequence Number Imliahzation 

missing, repeated, or unordered Send packets. The requestor and responder logic each maintain an 8 bit 

sequence numbered for each VI in use. When the VI is 

Ordenng of RDMA Packets ^ created, the requester on one node and the respond on the 

RDMA operations have slightly different ordering remote node initialize their sequence numbers to a common 

requirements than Send operations. An RDMA packet con- value, zero i the preferred embodiment, 

tains the address to which the destination end node writes the After this, the requester places its sequence number into 

packet contents. This allows multiple RDMA packets within each of the outgoing request packets. As depicted in FIG. 6, 

an RDMA message to complete out of order. The contents of 65 the sequence number, SEQ, is included in each request 

each packet are written to the correct place in the end node's packet. The responder compares the sequence number from 

memory, regardless of the order in which they complete. the incoming request packet with the responder's local copy. 
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The responder uses this comparison to determine if the packet and throws it away. The receive logic in the VI 

packet is valid, if it is a duplicate of a packet already is not stopped and the responder does not increment its 

received, or if it is an out-of-sequence packet An out-of- sequence number; or 

sequence packet can only happen if the responder missed an a dupUcate packet (which is being resent becatise the 

incoming packet. The responder can choose to return a 5 requestor must not have received an earlier ACK) in 

'sequence error NACK packet' or it can simply ignore the which case the responder ACKs the packet and throws 

out-of-scquence padcet. In the latter case, the requester will n ^way. If the request had been an RDMA Read, the 

have a time-out on the request (and presumably on the responder completes the read operation and returns the 

packet the responder missed) and initiate error recover data with a positive acknowledgment. 

Generating a sequence error NACK packet is preferred as it 10 An example of the re^onder checking sequence numbers 

forces the requester to start error recovery more quickly. for ordered and unordered packets is given in FIG. 6. In FIG. 

The following describes how the sequence numbers are 6, during the first two Send transactions, the responder 

generated and checked. checks that the SEQ in the packet matches the local copy of 

Rsp. SN. Since the Send packets include ACB indicating 

Generating Sequence Numbers for Request Packets 15 ordered packets, the Rsp. SN is incremented after each 

When transmitting ordered packets (i.e. transfers are on a ^^^ponse packet is transmitted. At the end of the fiisl two 

specific source port to a specific destination port and the Send transactions, Rqst. SN and Rsp. SN both equal 6. The 

ACB specifies a specific lane) the request sequence number packets for the RDMAinclude an ACB mdicatmg unordered 

is incremented after each packet is sent. When transmitting ^^^^P^ ^ ^1°^^^. Neither the requestor or responder incre- 

unordered packets (i.e. multipathing is used and/or the ACB ^ local copy of SN. ITius, at the end of the RDMA 

bits specify full link set adaptivity) the request sequence transaction both Rqst. SN and Rsp. SN«6. The first packet 

number is not incremented after such a packet is sent. subsequent Send transaction has SEQ«6 and SEQ 

^ ^ ' ✓ 1 * f t> 1 matches the local copy of Rsp. SN. Smce Send packets are 

For example m FIG. 6 dunng the first two Send ordered,theresponderincrementsitslocalcopy of Rsp.SN. 

transactions, the local copy of the request sequence number 25 

is incremented after the packet is sent (Rqst. SN=0 to 6). For Sequence Numbers on Response Packets 
the RDMA operation, ^ich sends 2500 bytes unordered, 

the requester does not increment local copy of the request When generating either a positive or negative 

sequence number (Rqst. SN»6). The requester does not acknowledgment, the responder logic copies the incoming 

increment the local copy of the SN until after the first packet 30 number and uses it in the sequence number field of 

of the Send following the RDMA is transmitted. ^he acknowledgment. 

Send packets are typically sent fully ordered lest the J^^ requestor logic matches incoming responses with the 

requestor have to wait for an acknowledgment for each originating request by comparing the SourcelD, VI number, 

packet before proceeding to the next. On the other hand, Sequence number, transaction type, and transaction Serial 

RDMA packets may be sent either ordered or unordered. To 35 Number (TSN) with that of the originating request, 

take advantage of multipathing, a requestor must use imor- ^^^^ Recove and Path State 

dered RDMA packets. ^ 

Hie sender guarantees to never exceed the windowsize Error recovery is initiated by the requesting node when- 

number of packets outstanding per VI. If S is the number of ever the requestor fails to get a positive acknowledgment for 

bits in the sequence number, then the windowsize is 2**(S- 40 each of its request packets. A time-out or NACK indicating 

a sequence number error, can cause the requestor's Kernel 

A packet is outstanding until it and all its predecessors are ^ ^^^^ recovery 

acknowledged. Ihc requestor does not made a descriptor ^^^j. recovery involves three basic steps: 
done until all packets requested by that descriptor are 

positively acknowledged. 1) Completing a barrier operation(s) to flush out any 

errant request or response packets. 

Checking Sequence Numbers on Incoming Request 2) Disabling a bad path if the barrier operation failed. 

Packets Retransmitting from the earliest packet that had failed. 

The destination node responding to the incoming request The first two steps will now be described with reference 

packet checks each incoming request packet to verify its to HG. 7, which is two interlocked state diagrams showing 

sequence number against the responder's local copy. the state that software on the requestor and responder moves 

Hie responder logic compares its sequence number with through for each path. In FIG. 7, dashed lines represent 

the packet's sequence number to determine if the incoming Kernel to Kernel Supervisory Protocol messages that modify 

packet is either: 55 the remote node's state. 

the expected packet it is looking for (i.e., the packet's The ServerNet architecture allows multiple paths between 

sequence number is the same as the sequence number end nodes. The requestor repeats these two basic steps on 

maintained by the re^onder logic), in which case the each path until the packet is transmitted successfully, 

responder processes the packet and if all other checks The requestor and responder SW each maintain a view of 

are passed, the packet is Acknowledged and committed 50 the stale of each path. The requestor uses its view of the path 

to memory. If the transaction is ordered, the responder state to determine which path it uses for Send and RDMA 

increments its sequence number. If the transaction is operations. The responder uses its view of the path state to 

imordered, the responder does not increment its determine which input paths it allows incoming requests on. 

sequence number; The responder logic maintains a four bit field (RcqInPath 

an out-of-sequence packet (which means an earlier 65 Vector) for each VI in use. Each of the four bits corresponds 

incoming packet must have gotten lost), beyond the one to one of the four possible paths between the requestor's two 

it is looking for in which case the re^onder NACKs the ports and the responder 's two ports. The requestor only 
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acx;epts incx)ming requests from a particular source or des- send four barrier operations (one for each lane) or it can 

tination port if the corresponding bit in the ReqlnPathVector maintain state, telling how many lanes are in use on the fat 

is set pipes between any given source/destination pair. The barrier 

The requestor and rcsponder communicate using the need only be sent along the same path as the original request 
kernel-to-kerael Supervisory protocol to communicate path 5 that failed. There is no requirement for the barrier to be sent 

state changes. from or to the same VI. 

The requester's view of the path state transitions from TXiming now to the third step, i.e., retransmitting from the 

good to bad whenever the requestor fails to get an acknowl- earliest packet that failed to receive a positive 

edgmem (either positive or negative) to a request. The acknowledgement.[, after] After notification of the error, 

requestor detects the lack of an ACK or NACK by getting a requestor retransmits the packets starting at (or before) the 

time-out error. Tlie requestor can attempt a barrier operation ^^^^^ ^^^.^^ ^ ^^.^^ acknowledgement, 

on the path to see if the failure is pcrmajient or traiisient K ^^^^^ WindowSize number of 

the bamer succeeds, the path is considered good and ^e packets jhe responder logic acknowledges and then ignores 

ongmal operation can be retried. If the bamer fails, the ^ , . 5 Tt*u i. 1 j u . j 

re^stor Lust resort to a different good path. ^YP'^'^^ t^at are resent if they have already been stored 

„ - , , Jt^ i J *u *u 15 m the receive queue. When the correct packet is reached, the 

Before the requestor can try a different good path, the 1 ■ . « 1:^ l • 

requestor must inform the destination that the ori^nal path ^^sponder logic can teU from Uie sequence number that it is 

is bad. This is done by any path possible. For example, in ^"^^ ^ "^^^"^^ ^^^^^^ q"^^^* 

VIA the Kernel Agent to Kernel Agent Supervisory Protocol Examples of retransmission after faihire to receive a 

is used. After the destination is informed the path is bad, the response are depicted in FIGS. 8 and 9. In FIG. 8, the request 
destination disables a bit in a four bit field (ReqlnPath ^ packet with SEQ=2 is corrupted. The missing request is 

Vector), thereby ignoring incoming requests from that path. detected by the responder on the next packet and NACKed 

The requester then stops using the bad path until a subse- (Negative Acknowledged) and all subsequent packets are 

quent barrier transaction determines that the path is good. thrown away and NACKed. The requestor resets its send 

After the destination admowledges the supervisory protocol engine to start generating packets at the one that failed to 
message, indicating that the destination has disabled ^ receive an ACK (in this case Rqst. SNo2). The responder 

requests from the offending path, the requestor is free to recognizes the SEQ=2 and accepts the packets, 

retry the message on a different path. In FIG. 9, the response packet with SEQ-1 is corrupted. 

After a time-out error, the requester attempts to bring the The missing response is detected when the requestor times 

path back to a useful state by completing a banier operation. out its transaction. The requestor resets its send engine to 

The barrier operation ensures there are no other packets in start generating packets at the one that failed to receive an 

any buffer that might show up later and corrupt the data ACK (in this case rqst.SN-1). The responder recognizes the 

transfer. resent packets as already having been received, acknowl- 

Barrier operations are used in error recovery to flush any edges them and discards the data, 

stale request or stale response packets from a particular path Note that the response packets for this particular RDMA 

in the SAN. A path is the collection of ServerNet links transaction all have the same value of SEQ because the 

between a specific port of two end nodes. request SNs and response SNs are not incremented for 

A VIA barrier operation is done with a RDMA Write RDMA transactions that are unordered. In this case, the 

followed by an RDMA Read. Anumber chosen from a large TSNs are utilized by the requestor to match response packets 
number space (either incrementing or pseudo random) is ^ outstanding requests. 

written to a fixed location (e.g. a page number agreed to, a Error recovery places several requirements on the request- 
priori, by the kemel agent-to-kernel agent Supervisory Pro- or's KA (Kernel Agent, the kernel mode driver code rcspon- 
tocol and either a fixed or random offset within the page). sible for SAN error recovery): 

The number is then read back with an RDMA read. If the 1) The KAmust determine the sequence number to restart 

read value matches the write value, then the barrier sue- with. 

ceeded and there are guaranteed to be no more Send or 2) The KAmust determine the proper data contents of the 

RDMA request or response packets on that path between the packet to be resent. 

requester and responder. 3) in order for the KA to determine the appropriate 

If the RDMA operation fails because the number read sequence number, it must be aware of how the hardware 

back does not match the number written, then the barrier is packetizes data under any given combination of descriptors, 

tried again. This could have happened because a previous data segments, page crossings etc. 

response in the network came back and fulfilled the barrier. NqIc the responder side does not require KA involvement 

Note that the barrier needs to be done separately on all (unless a barrier operation fails), 

paths the RDMA operation could have taken. That is, if the xhe invention has now been described with reference to 
RDMA operation was being generated from multiple source 55 the preferred embodiments. Alternatives and substitutions 

ports (multipathing) and was using full link adaptivity (the will now be apparent to persons of skUl in the art. For 

packets were allowed to take any one of four possible example, while the invention has been described in the 

"lanes"), then separate barrier operations must be done from context of the ServerNet II SAN, the principles of the 

each source port to each destination port, over each of the invention are useful in any network that utilizes multiple 
possible link adapUve paths. go paths between end nodes. Accordingly, it is not intended to 

The barrier operation must be done for each of the limit the invention except as provided by the appended 

possible "lanes" between a specific Source port and Desti- claims, 

nation port. A barrier done on one VI ensures that all other What is claimed is: 

Vis using that source port and destination port have no 1. A method for error detection and recovery in a system 
remaining request or response packets lurking in the SAN, es with a plurality of networked nodes, including source and 

If a path traverses a "fat-pipe" a separate barrier must be destination, communicating with each other via paths, com- 

sent down each lane of the fat pipe. SW can either blindly prising: 
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creating a request packet for a request transaction, 

routing the request packet fix>m the source to the desti- 
nation via a particular path; 

maintaining at each of the source and destination a status 
for each of its respective paths; 5 

detecting a time-out error for failure within a predeter- 
mined time limit to receive at the source an acknowl- 
edge (ACK) packet or a negative-acknowledge 
(NACK) packet in response to the request packet, the 
time-out error created from a failure of the particular lo 
path; 

performing a barrier transaction via the particular path to 
determine if the failure of the particular path is transient 
or permanent; 

periodically repeating the barrier transaction via the par- 15 
ticular path in order to determine if its failure is cured; 

re-transmitting the request packet via the particular path if 
the failure is transient; and 

if the failure is permanent, 

updating at the source the status for the particular path, 20 
and 

routing to the destination information about the 
updated status via an alternate path to prompt updat- 
ing at the destination of the status for the particular 
pat wherein a failed path is not MS&d, 

2. The method of claim 1, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 
single path as the particular path. 

3. A method as in claim 1, wherein when the transaction 
is not ordered the routing is link set adaptive enabhng 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a link set 

4. A method as in claim 1, wherein the request packet is 
discarded if it is invalid. 

5. A method as in claim 1, wherein the request packet 35 
contains a sequence number to indicate its place in a 
sequence of request packets and adaptive control bits to 
indicate the routing as ordered or not ordered, 

6. A method as in claim 1, further comprising: 
maintaining a sequence number in each of the source and 40 

destination, the sequence numbers being initialized to a 
common value; and 
including the sequence n;imber in the request packet, 
wherein the sequence number is tracked by the source 
and destination, and wherein the NACK packet is 
created if a mismatch is found between the sequence 
number in the request packet and the sequence number 
at the destination. 

7. A method for error detection and recovery in a system 
with a plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 
prising: 

maintaining a sequence nimiber in each of the source and 
destination, the sequence numbers being initialized to a 
common value; 

creating a request packet for a request transaction, the 
request packet containing the sequence number; 

routing the request packet from the soiirce to the desti- 
nation via a particular path, wherein if the request 
transaction is ordered, the sequence number at the gg 
source is incremented; 

maintaining at each of the source and destination a status 
for each of its respective paths; 

checking the request packet for integrity and, if the 
request packet is valid, matching the sequence number 65 
in the packet with the sequence number at the destina- 
tion; 
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creating an acknowledge (ACK^ packet for a response 
transaction if the request packet is valid and the 
sequence number matching succeeds, and routing the 
ACK packet to the source, the ACK packet containing 
the sequence number from the request packet; 

creating a negative acknowledge (NACKT) packet for the 
response transaction if the sequence number matching 
fails, and routing the NACK packet to the source, the 
NACK packet containing the sequence number from 
the request packet; 

incrementing the sequence number at the destination if the 
request transaction is ordered and the sequence number 
matching succeeds; 

detecting a time-out error for failure within a predeter- 
mined time limit to receive at the source the ACK or the 
NACK packets in response to the reqiiest packet, the 
time-out error created from a failure of the particular 
path; 

performing a barrier transaction via the particular path to 
determine if the failure of the particular path is transient 
or permanent; 

periodically repeating the barrier transaction via the par- 
ticular path in order to determine if its failure is cured; 

re-transmitting the request packet via the particular path if 
the failure is transient; and 

if the failure is permanent, updating a status at the source 
for the particular path and routing to the destination 
information about the updated status via an alternate 
path to prompt updating of the status for the particular 
path at the destination, wherein a failed path is not used. 

8. A method as in claim 7, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of packets by selecting a single path as the par- 
ticular path. 

9. A method as in claim 7, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of packet routing by selection of paths from 
a path set. 

10. A method as in claim 7, wherein the request packet is 
discarded if it is invalid. 

11. A method as in claim 7, wherein the request packet 
contains adaptive control bits indicating whether the trans- 
action is ordered or not ordered. 

12. A system for error detection and recovery with a 
plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- 
prising: 

means for creating a request packet for a request trans- 
action; path means for routing the request packet from 
the source to the destination via a particular path; 

means for maintaining at each of the source and destina- 
tion a status for each of its respective paths; 

means for detecting a time-out error for failure within a 
predetermined time limit to receive at the source an 
acknowledge (ACK) packet or a negative-acknowledge 
(NACK) packet in response to the request packet, the 
time-out error created from a failure of the particular 
path; 

means for performing a barrier transaction via the par- 
ticular path to determine if the failure of the particular 
path is transient or permanent, including means for 
periodically repeating the barrier transaction via the 
particular path in order to determine if its failure is 
cured; 

means for re-transmitting from the source the request 
packet via the particular path if the failure is transient; 
and 
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if the failure is permanent, 

means for updating at the source the status for the 
particular path, and 

means for routing to the destination information about 
the updated status via an altemate path to prompt 5 
updating at the destination of the status for the 
particular path, wherein a failed path is not used. 

13. The system of claim 12, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a lO 
single path as the particular path. 

14. A system as in claim 12, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a link set. 15 

15. A system as in claim 12, further comprising: 
means for discarding the request packet if it is invalid. 

16. A system as in claim 12, wherein the request packet 
contains a sequence number to indicate its place in a 
sequence of request packets and adaptive control bits to 20 
indicate the routing as ordered or not ordered. 

17. A system as in claim 12, further comprising: 
means for maintaining a sequence number in each of the 

somcc and destination, the sequence numbers being 
initialized to a common value; and ^ 
means for including the sequence nimiber in the request 
packet, wherein the sequence number is tracked by the 
source and destination, and wherein the NACK packet 
is created if a mismatch is found between the sequence 
number in the request packet and the sequence number 
at the destination. 

18. A system for error detection and recovery in a system 
with a plurality of networked nodes, including source and 
destination, communicating with each other via paths, com- ^5 
prising: 

means for maintaining a sequence number in each of the 

source and destination, the sequence numbers being 

initiahzed to a common value; 
means for creating a request packet for a request 40 

transaction, the request packet containing the sequence 

number; 

means for routing the request packet from the source to 
the destination via a particular path, including means 
for incrementing the sequence number at the source if 45 
the request transaction is ordered; 

means for maintaining at each of the source and destina- 
tion a status for eacii of its re^ective paths; 

means for checking the request packet for integrity, 
including means for matching the sequence number in 
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the padcet against the sequence number at the destina- 
tion if the request packet is valid; 

means for creating an acknowledge (ACK) packet for a 
response transaction if the request packet is valid and a 
sequence number matching succeeds, including means 
for routing the ACK packet to the source, the ACK 
packet containing the sequence number from the 
request packet; 

means for creating a negative acknowledge (NACIQ 
packet for the response transaction if the sequence 
number matching fails, including means for routing the 
NACK packet to the source, the NACK packet con- 
taining the sequence number from the request packet; 

means for incrementing the sequence number if the 
request transaction is ordered and the sequence number 
matching succeeds; 

means for detecting a time-out error for failure within a 
predetermined time limit to receive at the source the 
ACK or the NACK packets in response to the request 
packet, the time-out error created firom a failure of the 
particular path; 

means for performing a barrier transaction via the par- 
ticular path to determine if the failure of the particular 
path is transient or permanent, including by periodi- 
cally repeating the barrier transaction via the particular 
path in order to determine if its failure is cured; 

means for re-transmitting the request packet via the 
particular path if the failure is transient; and 

if the failure is permanent, 

means for updating a status at the source for the 

particular path, and 
means for routing to the destination information about 
the updated status via an altemate path to prompt 
updating of the status for the particular path at the 
destination, wherein a failed path is not used. 

19. Asystem as in claim 18, wherein when the transaction 
is ordered the routing is deterministic preserving strict 
ordering of request, ACK and NACK packets by selecting a 
single path as the particular path. 

20. Asystem as in claim 18, wherein when the transaction 
is not ordered the routing is link set adaptive enabling 
dynamic change of request, ACK and NACK packet routing 
by selection of paths from a path set, 

21. A system as in claim 18, further comprising: 
means for discarding the request packet if it is invaHd, 

22. A system as in claim 18, wherein the request packet 
contains adaptive control bits indicating whether the trans- 
action is ordered or not ordered. 
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